13 research outputs found

    Specialization and reconfiguration of lightweight mobile processors for data-parallel applications

    Get PDF
    The worldwide utilization of mobile devices makes the segment of low power mobile processors leading in the entire computer industry. Customers demand low-cost, high-performance and energy-efficient mobile devices, which execute sophisticated mobile applications such as multimedia and 3D games. State-of-the-art mobile devices already utilize chip multiprocessors (CMP) with dedicated accelerators that exploit data-level parallelism (DLP) in these applications. Such heterogeneous system design enable the mobile processors to deliver the desired performance and efficiency. The heterogeneity however increases the processors complexity and manufacturing cost when adding extra special-purpose hardware for the accelerators. In this thesis, we propose new hardware techniques that leverage the available resources of a mobile CMP to achieve cost-effective acceleration of DLP workloads. Our techniques are inspired by classic vector architectures and the latest reconfigurable architectures, which both achieve high power efficiency when running DLP workloads. The high requirement of additional resources for these two architectures limits their applicability beyond high-performance computers. To achieve their advantages in mobile devices, we propose techniques that: 1) specialize the lightweight mobile cores for classic vector execution of DLP workloads; 2) dynamically tune the number of cores for the specialized execution; and 3) reconfigure a bulk of the existing general purpose execution resources into a compute hardware accelerator. Specialization enables one or more cores to process configurable large vector operands with new special purpose vector instructions. Reconfiguration goes one step further and allow the compute hardware in mobile cores to dynamically implement the entire functionality of diverse compute algorithms. The proposed specialization and reconfiguration techniques are applicable to a diverse range of general purpose processors available in mobile devices nowadays. However, we chose to implement and evaluate them on a lightweight processor based on the Explicit Data Graph Execution architecture, which we find promising for the research of low-power processors. The implemented techniques improve the mobile processor performance and the efficiency on its existing general purpose resources. The processor with enabled specialization/reconfiguration techniques efficiently exploits DLP without the extra cost of special-purpose accelerators.La utilización de dispositivos móviles a nivel mundial hace que el segmento de procesadores móviles de bajo consumo lidere la industria de computación. Los clientes piden dispositivos móviles de bajo coste, alto rendimiento y bajo consumo, que ejecuten aplicaciones móviles sofisticadas, tales como multimedia y juegos 3D.Los dispositivos móviles más avanzados utilizan chips con multiprocesadores (CMP) con aceleradores dedicados que explotan el paralelismo a nivel de datos (DLP) en estas aplicaciones. Tal diseño de sistemas heterogéneos permite a los procesadores móviles ofrecer el rendimiento y la eficiencia deseada. La heterogeneidad sin embargo aumenta la complejidad y el coste de fabricación de los procesadores al agregar hardware de propósito específico adicional para implementar los aceleradores. En esta tesis se proponen nuevas técnicas de hardware que aprovechan los recursos disponibles en un CMP móvil para lograr una aceleración con bajo coste de las aplicaciones con DLP. Nuestras técnicas están inspiradas por los procesadores vectoriales clásicos y por las recientes arquitecturas reconfigurables, pues ambas logran alta eficiencia en potencia al ejecutar cargas de trabajo DLP. Pero la alta exigencia de recursos adicionales que estas dos arquitecturas necesitan, limita sus aplicabilidad más allá de las computadoras de alto rendimiento. Para lograr sus ventajas en dispositivos móviles, en esta tesis se proponen técnicas que: 1) especializan núcleos móviles ligeros para la ejecución vectorial clásica de cargas de trabajo DLP; 2) ajustan dinámicamente el número de núcleos de ejecución especializada; y 3) reconfiguran en bloque los recursos existentes de ejecución de propósito general en un acelerador hardware de computación. La especialización permite a uno o más núcleos procesar cantidades configurables de operandos vectoriales largos con nuevas instrucciones vectoriales. La reconfiguración da un paso más y permite que el hardware de cómputo en los núcleos móviles ejecute dinámicamente toda la funcionalidad de diversos algoritmos informáticos. Las técnicas de especialización y reconfiguración propuestas son aplicables a diversos procesadores de propósito general disponibles en los dispositivos móviles de hoy en día. Sin embargo, en esta tesis se ha optado por implementarlas y evaluarlas en un procesador ligero basado en la arquitectura "Explicit Data Graph Execution", que encontramos prometedora para la investigación de procesadores de baja potencia. Las técnicas aplicadas mejoraran el rendimiento del procesador móvil y la eficiencia energética de sus recursos para propósito general ya existentes. El procesador con técnicas de especialización/reconfiguración habilitadas explota eficientemente el DLP sin el coste adicional de los aceleradores de propósito especial

    Evaluation of vectorization potential of Graph500 on Intel's Xeon Phi

    Get PDF
    Graph500 is a data intensive application for high performance computing and it is an increasingly important workload because graphs are a core part of most analytic applications. So far there is no work that examines if Graph500 is suitable for vectorization mostly due a lack of vector memory instructions for irregular memory accesses. The Xeon Phi is a massively parallel processor recently released by Intel with new features such as a wide 512-bit vector unit and vector scatter/gather instructions. Thus, the Xeon Phi allows for more efficient parallelization of Graph500 that is combined with vectorization. In this paper we vectorize Graph500 and analyze the impact of vectorization and prefetching on the Xeon Phi. We also show that the combination of parallelization, vectorization and prefetching yields a speedup of 27% over a parallel version with prefetching that does not leverage the vector capabilities of the Xeon Phi.The research leading to these results has received funding from the European Research Council under the European Unions 7th FP (FP/2007- 2013) / ERC GA n. 321253. It has been partially funded by the Spanish Government (TIN2012-34557)Peer ReviewedPostprint (published version

    DLP acceleration on general purpose cores

    Get PDF
    High-performance and power-efficient multimedia computing drives the design of modern and increasingly utilized mobile devices. State-of-the-art low power processors already utilize chip multiprocessors (CMP) that add dedicated DLP accelerators for emerging multimedia applications and 3D games. Such heterogeneous processors deliver desired performance and efficiency at the cost of extra hardware specialized accelerators. In this paper, we propose dynamically-tuned vector execution (DVX) by morphing one or more available cores in a CMP into a DLP accelerator. DVX improves performance and power efficiency of the CMP, without additional costs for dedicated accelerators

    DLP acceleration on general purpose cores

    Get PDF
    High-performance and power-efficient multimedia computing drives the design of modern and increasingly utilized mobile devices. State-of-the-art low power processors already utilize chip multiprocessors (CMP) that add dedicated DLP accelerators for emerging multimedia applications and 3D games. Such heterogeneous processors deliver desired performance and efficiency at the cost of extra hardware specialized accelerators. In this paper, we propose dynamically-tuned vector execution (DVX) by morphing one or more available cores in a CMP into a DLP accelerator. DVX improves performance and power efficiency of the CMP, without additional costs for dedicated accelerators

    VALib and SimpleVector: Tools for rapid initial research on vector architectures

    No full text
    Vector architectures have been traditionally applied to the supercomputing domain with many successful incarnations. The energy efficiency and high performance of vector processors, as well as their applicability in other emerging domains, encourage pursuing further research on vector architectures. However, there is a lack of appropriate tools to perform this research. This paper presents two tools for measuring and analyzing an application's suitability for vector microarchitectures. The first tool is VALib, a library that enables hand-crafted vectorization of applications and its main purpose is to collect data for detailed instruction level characterization and to generate input traces for the second tool. The second tool is SimpleVector, a fast trace-driven simulator that is used to estimate the execution time of a vectorized application on a candidate vector microarchitecture. The potential of the tools is demonstrated using six applications from emerging application domains such as speech and face recognition, video encoding, bioinformatics, machine learning and graph search. The results indicate that 63.2% to 91.1% of these contemporary applications are vectorizable. Then, over multiple use cases, we demonstrate that the tools can facilitate rapid evaluation of various vector architecture designs.The research leading to these results has received funding from the European Research Council under the European Union's 7th FP (FP/2007-2013) / ERC GA n. 321253. It has been partially funded by the Spanish Government (TIN2012-34557).Peer ReviewedPostprint (published version

    EVX: vector execution on low power EDGE cores

    No full text
    In this paper, we present a vector execution model that provides the advantages of vector processors on low power, general purpose cores, with limited additional hardware. While accelerating data-level parallel (DLP) workloads, the vector model increases the efficiency and hardware resources utilization. We use a modest dual issue core based on an Explicit Data Graph Execution (EDGE) architecture to implement our approach, called EVX. Unlike most DLP accelerators which utilize additional hardware and increase the complexity of low power processors, EVX leverages the available resources of EDGE cores, and with minimal costs allows for specialization of the resources. EVX adds a control logic that increases the core area by 2.1%. We show that EVX yields an average speedup of 3x compared to a scalar baseline and outperforms multimedia SIMD extensions. © 2014 EDAA.Peer ReviewedPostprint (published version

    Evaluation of vectorization potential of Graph500 on Intel's Xeon Phi

    No full text
    Graph500 is a data intensive application for high performance computing and it is an increasingly important workload because graphs are a core part of most analytic applications. So far there is no work that examines if Graph500 is suitable for vectorization mostly due a lack of vector memory instructions for irregular memory accesses. The Xeon Phi is a massively parallel processor recently released by Intel with new features such as a wide 512-bit vector unit and vector scatter/gather instructions. Thus, the Xeon Phi allows for more efficient parallelization of Graph500 that is combined with vectorization. In this paper we vectorize Graph500 and analyze the impact of vectorization and prefetching on the Xeon Phi. We also show that the combination of parallelization, vectorization and prefetching yields a speedup of 27% over a parallel version with prefetching that does not leverage the vector capabilities of the Xeon Phi.The research leading to these results has received funding from the European Research Council under the European Unions 7th FP (FP/2007- 2013) / ERC GA n. 321253. It has been partially funded by the Spanish Government (TIN2012-34557)Peer Reviewe

    Imposing coarse-grained reconfiguration to general purpose processors

    No full text
    Mobile devices execute applications with diverse compute and performance demands. This paper proposes a general purpose processor that adapts the underlying hardware to a given workload. Existing mobile processors need to utilize more complex heterogeneous substrates to deliver the demanded performance. They incorporate different cores and specialized accelerators. On the contrary, our processor utilizes only modest homogeneous cores and dynamically provides an execution substrate suitable to accelerate a particular workload. Instead of incorporating accelerators, the processor reconfigures one or more cores into accelerators on-the-fly. It improves performance with minimal hardware additions. The accelerators are made of general purpose ALUs reconfigured into a compute fabric and the general purpose pipeline that streams data through the fabric. To enable reconfiguration of ALUs into the fabric, the floorplan of a 4-core processor is changed to place the ALUs in close proximity on the chip. A configurable switched network is added to couple and dynamically reconfigure the ALUs to perform computation of frequently repeated regions, instead of executing general purpose instructions. Through this reconfiguration, the mobile processor specializes its substrate for a given workload and maximizes performance of the existing resources. Our results show that reconfiguration accelerates a set of selected compute intensive workloads by 1.56×, 2,39×, 3,51×, when configuring the accelerator of 1-, 2-, or 4- cores respectively.Peer ReviewedPostprint (published version

    VALib and SimpleVector: Tools for rapid initial research on vector architectures

    No full text
    Vector architectures have been traditionally applied to the supercomputing domain with many successful incarnations. The energy efficiency and high performance of vector processors, as well as their applicability in other emerging domains, encourage pursuing further research on vector architectures. However, there is a lack of appropriate tools to perform this research. This paper presents two tools for measuring and analyzing an application's suitability for vector microarchitectures. The first tool is VALib, a library that enables hand-crafted vectorization of applications and its main purpose is to collect data for detailed instruction level characterization and to generate input traces for the second tool. The second tool is SimpleVector, a fast trace-driven simulator that is used to estimate the execution time of a vectorized application on a candidate vector microarchitecture. The potential of the tools is demonstrated using six applications from emerging application domains such as speech and face recognition, video encoding, bioinformatics, machine learning and graph search. The results indicate that 63.2% to 91.1% of these contemporary applications are vectorizable. Then, over multiple use cases, we demonstrate that the tools can facilitate rapid evaluation of various vector architecture designs.The research leading to these results has received funding from the European Research Council under the European Union's 7th FP (FP/2007-2013) / ERC GA n. 321253. It has been partially funded by the Spanish Government (TIN2012-34557).Peer Reviewe

    EVX: vector execution on low power EDGE cores

    No full text
    In this paper, we present a vector execution model that provides the advantages of vector processors on low power, general purpose cores, with limited additional hardware. While accelerating data-level parallel (DLP) workloads, the vector model increases the efficiency and hardware resources utilization. We use a modest dual issue core based on an Explicit Data Graph Execution (EDGE) architecture to implement our approach, called EVX. Unlike most DLP accelerators which utilize additional hardware and increase the complexity of low power processors, EVX leverages the available resources of EDGE cores, and with minimal costs allows for specialization of the resources. EVX adds a control logic that increases the core area by 2.1%. We show that EVX yields an average speedup of 3x compared to a scalar baseline and outperforms multimedia SIMD extensions. © 2014 EDAA.Peer Reviewe
    corecore